Adventures in Data Science

Overview

This tutorial covers the basics of using the Command Line and Git to track and record changes to files on your local computer. It provides background information that will help you to better understand the concepts that we will discuss in class and to better participate in the hands-on portion of the course.

Working with the Command Line

Most users interact with their computer through a Graphical User Interface (GUI) that allows them to use a mouse, keyboard, and graphical elements on screen (such as file menus, pictures of folders and files, etc.) to perform their work. Users tend to conflate their Operating System and their GUI because computer hardware and software manufacturers tightly pack these two concerns as a convenience to users. But the Windows 10 or Mac Big Sur operating system that makes your computer work and the Windows 10 or Mac Big Sur GUI that you interact with are, in fact completely different and separable software packages and it is possible to use different methods/software to interact with your computer than the stock, tightly coupled GUI that launches automatically when you turn on your computer.

Because computer manufacturers like Windows and Mac devote so many resources to the development of their system GUIs, there are few viable (at present, none, commercially available) competing GUIs for these platforms. This is not the case in the Linux world, however, where users have several system GUI packages from which to choose and can seamlessly switch between them as desired. Despite the lack of competition/choice on the GUI front when it comes to interacting with your computer, there are other, non-graphical ways of communicating directly with your operating system that exist for all operating systems. We call these “Command Line” interfaces. The Command Line offers a text-only, non graphical means of interacting with your computer. In the early days of computing, all user interaction with the computer happened at the command line. In the current days of graphical user interfaces, using the Command Line requires you to launch a special program that provides Command Line access.

Mac users will use an application called “Terminal” which ships by default with the Mac operating system. To launch the Terminal application, go to:

Applications -> Utilities -> Terminal

When you launch the application, you will see something like this:

Windows users will use an application called Git Bash, which was installed on your system when you installed Git. To launch Git Bash, go to:

Click on the Windows Start Menu and search for “Git Bash”

Alternatively,

Click on the Windows Start Menu, select Programs, and browse to Git Bash

When you launch the application, you will see something like this:

Interacting with the Command Line

While it can look intimidating to those raised on the GUI, working with the Command Line is actually quite simple. Instead of pointing and clicking on things to make them happen, you type written commands.

The figure below shows a new, empty Command Line Interface in the Mac Terminal application

The Command Line prompt contains a lot of valuable information. The beginning of the line, “(base) MacPro-F5KWP01GF694” tells us exactly which computer we are communication with. This may seem redundant, but it is actually possible to interact with computers other than the one you are typing on by connecting to them via the Command Line over the network.

The bit of information after the colon, in this example the “~” character tells us where in the computer’s filesystem we are. We’ll learn more about this later, for now you need to undersant that the “~” character means that you are in your home directory.

The next piece of information we are given is the username under which we are logged into the computer, in this case, my local username, “cstahmer”.

After the username, we see the “$” character. This is known as the Command Prompt. It is an indicator that the Command Line application is waiting for you to enter something. The Command Prompt character is used througout these materials when giving command examples. When working through materials, DO NOT ENTER the Command Prompt. It will already be there telling you that the computer is ready to receive your command.

Depending on your system and/or Command Line interface, you may or may not also see a solid or flashing box that appears after the Command Prompt. This is a Cursor Position Indicator, which tells you where the current cursor is in the terminal. This is useful if you need to go gack and correct an error. Generally speaking, you can’t click a mouse in a terminal app to edit text. You need to use your computer’s right and left arrows to move the cursor to the correct location and then make your edit.

As noted earlier, we interact with the Command Line by typing commands. The figure below shows an example of a simple command, “echo” being entered into the Command Line.

The “echo” command prints back to screen any text that you supply to the command It literally echoes your text. To execute, this or any command, you simply hit the “return” or “enter” key on your keyboard. You’ll see that when you execute a Command Line command the sytem performs the indicated operation, prints any output from the operation to screen and then delivers a new Command Line prompt.

Note that depending on your particular system and/or Command Line interface, things might look slightly different on your computer. However, the basic presentation and function as described above will be the same.

Common Command Line Commands

During our hands-on, in-class session we will practice using the following Command Line commands. Be prepared to have this page ready as a reference during class to make things easier.

Command Name Function
ls List Lists all files in the current directory.
ls -l List with Long flag Lists additional information about each file.
ls -a List with All flag Lists all files, including hidden files.
pwd Print Working Directory Prints the current working directory.
mkdir Make Directory Creates a new file directory.
cd Change Directory Navigates to another directory on the file system.
mv Move Moves files.
cp Copy Copies files.
rm Remove/delete Deletes files.

For a more complete list of Unix Commands, see the Unix Cheat Sheet.

Command Line Text Editors

The Command Line also features a variety of different text editors, similar in nature to Microsoft Word or Mac Pages but much more stripped down. These editors are only accessible from the Command Line; we won’t spend very much time with them, but it is important to know how to use them so that you can open, read, and write directly in the Command Line window.

Macs and Git Bash both ship with a text editor called Vim (other common editors include Emacs and Nano). To open a file with vim, type vi in a Command Line window, followed by the filename. If you want to create a new file, simply type the filename you’d like to use for that file after vi.

Vim works a bit differently than other text editors and word processors. It has a number of ‘modes,’ which provide different forms of interaction with a file’s data. We will focus on two modes, Normal mode and Insert. When you open a file with Vim, the program starts in Normal mode. This mode is command-based and, somewhat strangely, it doesn’t let you insert text directly in the document (the reasons for this have to do with Vim’s underlying design philosophy: we edit text more than we write it on the Command Line).

To insert text in your document, switch to Insert mode by pressing i. You can check whether you’re in Insert mode by looking at the bottom left hand portion of the window, which should read -- INSERT --.

Once you are done inserting text, pressing ESC (the Escape key) will bring you back to Normal mode. From here, you can save and quit your file, though these actions differ from other text editors and word processors: saving and quitting with Vim works through a sequence of key commands (or chords), which you enter from Normal mode.

To save a file in Vim, make sure you are in Normal mode and then enter :w. Note the colon, which must be included. After you’ve entered this key sequence, in the bottom left hand corner of your window you should see “[filename] XL, XC written” (L stands for “lines” and C stands for “characters”).

To quit Vim, enter :q. This should take you back to your Command Line and, if you have created a new file, you will now see that file in your window.

If you don’t want to save the changes you’ve made in a file, you can toss them out by typing :q! in place of :w and then :q. Also, in Vim key sequences for save, quit, and hundreds of other commands can be chained together. For example, instead of separately inputting :w and :q to save and quite a file, you can use :wq, which will produce the same effect. There are dozens of base commands like this in Vim, and the program can be customized far beyond what we need for our class. More information about this text editor can be found here.

Basic Vim Commands

Command Function
esc Enter Normal mode.
i Enter Insert mdoe.
:w Save.
:q Quit.
:q! Quit without saving.

For a more complete list of Vim commands, see this Cheat Sheet.

Introduction to Version Control

This section covers the basics of using Version Control Software (VCS) to track and record changes to files on your local computer. It provides background information that will help you to better understand what VCS is, why we use it, and how it does its work.

What is Version Control?

Version control describes a process of storing and organizing multiple versions (or copies) of documents that you create. Approaches to version control range from simple to complex and can involve the use of various human workflows and/or software applications to accomplish the overall goal of storing and managing multiple versions of the same document(s).

Most people have a folder/directory somewhere on their computer that looks something like this:

Or perhaps, this:

This is a rudimentary form of version control that relies completely on the human workflow of saving multiple versions of a file. This system works minimally well, in that it does provide you with a history of file versions theoretically organized by their time sequence. But this filesystem method provides no information about how the file has changed from version to version, why you might have saved a particular version, or specifically how the various versions are related. This human-managed filesystem approach is more subject to error than software-assisted version control systems. It is not uncommon for users to make mistakes when naming file versions, or to go back and eit files out of sequence. Software-assisted version control systems (VCS) such as Git were designed to solve this problem.

Software Assisted Version Control

Version control software has its roots in the software development community, where it is common for many coders to work on the same file, sometimes synchronously, amplifying the need to track and understand revisions. But nearly all types of computer files, not just code, can be tracked using modern version control systems. IBM’s OS/360 IEBUPDTE software update tool is widely regarded as the earliest and most widely adopted precursor to modern, version control systems. Its release in 1972 of the Source Code Control System (SCCS) package marked the first, fully fledged system designed specifically for software version control.

Today’s marketplace offers many options when it comes to choosing a version control software system. They include systems such as Git, Visual Source Safe, Subversion, Mercurial, CVS, and Plastic SCM, to name a few. Each of these systems offers its twist on version control, differing sometimes in the area of user functionality, sometimes in how it handles things on the back-end, and sometimes both. This tutorial focuses on the Git VCS, but in the sections that follow we offer some general information about classes of version control systems to help you better understand how Git does what it does and help you make more informed decisions about how to deploy it for you own work.

Local vs Server Based Version Control

There are two general types of version control systems: Local and Server (sometimes called Cloud) based systems. When working with a Local version control system, all files, metadata, and everything associated with the version control system live on your local drive in a universe unto itself. Working locally is a perfectly reasonable option for those who work independently (not as part of a team), have no need to regularly share their files or file versions, and who have robust back-up practices for their local storage drive(s). Working locally is also sometimes the only option for projects involving protected data and/or proprietary code that cannot be shared.

Server based VCS utilize software running on your local computer that communicates with a remote server (or servers) that store your files and data. Depending on the system being deployed, files and data may reside exclusively on the server and are downloaded to temporary local storage only when a file is being actively edited. Or, the system may maintain continuous local and remote versions of your files. Server based systems facilitate team science because they allow multiple users to have access to the same files, and all their respective versions, via the server. They can also provide an important, non-local back-up of your files, protecting you from loss of data should your local storage fail.

Git is a free Server based version control system that can store files both locally and on a remote server. While the sections that follow offer a broader description of Server based version control, in this workshop we will focus only on using Git locally and will not configure the software to communicate with, store files on, or otherwise interact with a remote server. DataLab’s companion “Git for Teams” workshop focuses on using Git with the GitHub cloud service to capitalize on Git’s distributed version control capabilities.

Server based version control systems can generally be segmented into two distinct categories: 1) Centralized Version Control Systems (Centralized VCS) and 2) Distributed Version Control Systems (Distributed VCS).

Central Version Control Systems

Centralized VCS is the oldest and, surprisingly to many, still the dominant form of version control architecture worldwide. Centralized VCS implement a “spoke and wheel” architecture to provided server based version control.

With the spoke and wheel architecture, the server maintains a centralized collection of file versions. Users utilize version control clients to “check-out” a file of interest to their local file storage, where they are free to make changes to the file. Centralized VCS typically restrict other users from checking out editable versions of a file if another user currently has the file checked out. Once the user who has checked out the file has finished making changes, they “check-in” their new version, which is then stored on the server from where it can be retrieved and “checked-out” by another user. As can be seen, Centralized VCS provide a very controlled and ordered universe that ensures file integrity and tracking of changes. However, this regulation comes at a cost. Namely, it reduces the ease with which multiple users can work simultaneously on the same file.

Distributed Version Control Systems

Distributed VCS are not dependent on a central repository as a means of sharing files or tracking versions. Distributed VCS implement a network architecture (as opposed to the spoke and wheel of the Centralized VCS as pictured above) to allow each user to communicate directly with every other user.

In Distributed VCS, each user maintains their own version history of the files being tracked, and the VCS software communicates between users to keep the various local file systems in sync with each other. With this type of system, the local versions of two different users will diverge from each other if both users make changes to the file. This divergence will remain in place until the local repositories are synced, at which time the VCS stitches (or merges) the two different versions of the file into a single version that reflects the changes made by each individual, and then saves the stitched version of the file onto both systems as the current version. Various mechanisms can then be used to resolve the conflicts that may arise during this merge process. Distributed VCS offer greater flexibility and facilitate collaborative work, but a lack of understanding of the sync/merge workflow can cause problems. It is not uncommon for a user to forget to synch their local repository with the repositories of other team members and, as a result, work for extended periods of time on outdated files that don’t reflect their teammates and result in work inefficiencies and merge challenges.

The Best of Both Worlds

An important feature of Distributed VCS is that many users and organizations choose to include a central server as a node in the distributed network. This creates an hybrid universe in which some users will sync directly to each other while other users will sync through a central server.

Syncing with a cloud-based server provides an extra level of backup for your files and also facilitates communication between users. But treating the server as just another node on the network (as opposed to a centralized point of control) puts the control and flexibility back in the hands of the individual developer. For example, in a true Centralized CVS, if the server goes down then nobody can check files in and out of the server, which means that nobody can work. But in a Distributed CVS this is not an issue. Users can continue to work on local versions and the system will sync any changes when the server becomes available. Git, which is the focus of this tutorial, is a Distributed VCS. You can use Git to share and sync repositories directly with other users or through a central Git server such as, for example, GitHub or GitLab.

VCS and the Computer File System

When we think about Version Control, we typically think about managing changes to individual files. From the user perspective, the File is typically the minimum accessible unit of information. Whether working with images, tabular data, or written text, we typically use software to open a File that contains the information we want to view or edit. As such, it comes as a surprise to most users that the concept of Files, and their organizing containers (Folders or Directories), are not intrinsic to how computers themselves store and interact with data. In this section of the tutorial we will learn about how computers store and access information and how VCS interact with this process to track and manage files.

How Computers Store and Access Information

For all of their computing power and seeming intelligence, computers still only know two things: 0 and 1. In computer speak, we call this a binary system, and the unit of memory on a hard-disk, flash drive, or computer chip that stores each 1 or 0 is called a bit. You can think of your computer’s storage device (regardless of what kind it is) as a presenting a large grid, where each box is a bit:

In the above example, as with most computer storage, the bits in our storage grid are addressable, meaning that we can designate a particular bit using a row and column number such as, for example, A7, or E12. Also, remember, that each bit can only contain one of two values: 0 or 1. So, in practice, our storage grid would actually look something like this:

All of the complex information that we store in the computer is translated to this binary language prior to storage using a system called Unicode. You can think of Unicode as a codebook that assigns a unique combination of 8, 16, 32, 64, etc. (depending on how old your computer is) ones and zeros to each letter, numeral, or symbol. For example, the 8-bit Unicode for the upper case letter “A” is “01000001”, and the 8-bit Unicode character for the digit “3” is “00110011”. The above grid actually spells out the phrase, “Call me Ishmael”, the opening line of Herman Melville’s novel Moby Dick.

An important aspect of how computers story information in binary form is that, unlike most human readable forms of data storage, there is no right to left, up or down, or any other regularized organization of bits on a storage medium. When you save a file on your computer, the computer simply looks for any open bits and starts recording information. The net result is that the contents of single file are frequently randomly interleaved with data from other files. This mode of storage is used because it maximizes the use of open bits on the storage device. But it presents the singular problem of not making data readable in a regularized, linear fashion. To solve this problem, all computers reserve a particular part of their internal memory for a “Directory” which stores a sector map of all chunks of data. For example, if you create a file called README.txt with the word “hello” in it, the computer would randomly store the Unicode for the five characters in the word “hello” on the storage device and make a directory entry something like the following:

Understanding the Directory concept and how computers store information is crucial to understanding how VCS mange your Files.

How VCS Manage Your Files

Most users think about version control as a process of managing files. For example, if I might have a directory called “My Project” that holds several files related to this project as follows:

One approach to managing changes to the above project files would be to store multiple versions of each file as in the figure below for the file analysis.r:

In fact, many VCS do exactly this. They treat each file as the minimum unit of data and simply save various versions of each file along with some additional information about the version. This approach can work reasonably well. However, it has limitations. First, this approach can unnecessarily consume space on the local storage device, especially if you are saving many versions of a very large file. It also has difficulty dealing with changes in filenames, typically treating the same file with a new name as a completely new file, thereby breaking the chain of version history.

To combat these issues, good VCS don’t actually manage files at all. They manage Directories. Distributed VCS like Git take this alternate approach to data storage that is Directory, rather than file, based.

Graph-Based Data Management

Git (and many other Distributed VCS) manage your files as collections of data rather than collections of files. Git’s primary unit of management is the “Repository,” or “Repo” for short, which is aligned with your computer’s Directory/Folder structure. Consider, for example, the following file structure:

Here we see a user, Tom’s, home directory, which contains three sub directories (Data, Thesis, and Tools) and one file (Notes.txt). Both the Data and Tools directories contain sub files and/or directories. If Tom wanted to track changes to the two files in the Data directory, he would first create a Git repository by placing the Data directory “under version control.”

When a repository is created, the Git system writes a collection of hidden files into the Data Directory that it uses to store information about all of the data that lives under that directory. This includes information about the addition, renaming, and deletion of both files and folders as well as information about changes to the data contained in the files themselves. Additions, deletions and versions of files are tracked and stored not as copies of files, but rather as a set of instructions that describes changes made to the underling data and the directory structure that describes them.

Additional Resources

The Git Book is the defintive Git resource and provides an excellent reference for everythign that we will cover in the Interactive session. There is no need to read the book prior to the session, but it’s a good reference resource to have avaialable as you begin to work with Git after the workshop.

Introduction to Git

Put some intro text here

Save, Stage, Commit

Git does not automatically preserve versions of every “saved” file. When working with Git, you save files as you always do, but this has no impact on the versions that are preserved in the repository. To create a “versions”, you must first add saved files to a Staging area and then “Commit” your staged files to the repository. The Commits that you make constituted the versions of files that are preserved in the repository.

Creating Your First Repo

Move to your Home directory

$ cd ~

note: The $ character represents your command promt. DO NOT type it into your terminal

Create a new directory for this workshop

$ mkdir introtogit

Change to the new directory

$ cd introtogit

Put the new directory under version control

$ git init

Checking the Status of a Repo

To check the status of a repository use the followign command

$ git status

Version of a File

In Gitspeak, we ‘commit’ if version of a file to the repository to save a copy of the current working version of a file as a version. This is a multi-step process in which we first ‘stage’ the file to be committed and then ‘commit’ the file.

STEP 1: Place the file you want to version into the Staging Area

$ git add <filename>

Replace in the command above with the actual name of the file you want to version.

STEP 2: Commit Staged Files

$ git commit -m 'A detailed comment explaining the nature of the versio being committed.  Do not include any apostrophe's in your comment.'

View a History of Your Commits

To get a history of commits

$ git log

To see commit history with patch data (insertions and deletions) for a specified number of commits

$ git log -p -2

To see abbreviated stats for the commit history

$ git log --stat

You can save a copy of your Git log to a text file with the following command:

$ git --no-pager log > log.txt

Comparing Commits

$ git diff <commit> <commit>

Comparing Files

$ git diff <commit> <file>

or

$ git diff <commit>:<file> <commit>:<file>

To View an Earlier Commit

$ git checkout <commit>

To solve Detached Head problem either RESET HEAD as described below or just chekout another branch

git checkout <branch>

To save this older version as a parallel branch execute

$ git checkout -b <new_branch_name

This will save the older commit as a new branch running parallel to master.

Undoing Things

One of the common undos takes place when you commit too early and possibly forget to add some files, or you mess up your commit message. If you want to redo that commit, make the additional changes you forgot, stage them, and commit again using the –amend option

$ git commit --amend

To unstage a file for commit use

$ git reset HEAD <file>

Throwing away changes you’ve made to a file

$ git checkout -- <file>

Rolling everything back to the last commit

$ git reset --hard HEAD

Rolling everything back to the next to last commit (The commit before the HEAD commit)

$ git reset --hard HEAD^

Rolling everything back tp two commits before the head

$ git reset --hard HEAD^2

Rolling everything back to an identified commit using HASH/ID from log

$ git reset --hard <commit>

When Things go Wrong!

To reset everything back to an earlier commit and make sure that the HEAD pointer is pointing to the newly reset HEAD, do the following

$ git reset --hard <commit>
$ git reset --soft HEAD@{1}

Git Branching

Branching provides a simple way to maintain multiple, side-by-side versions of the files in a repository. Conceptually, branching a repository creates a copy of the codebase in its current state that you can work on without affecting the primary version from which it was copied. This alows you to work down multiple paths without affecting the main (or other) codebase.

To see a list of branches in your repository

$ git branch

To create a new branch

$ git checkout -b hotfix

New branches are created of the current working branch. To change branches use

$ git checkout <branch name>

Merging Branches

When you merge a branch, git folds any changes that you made to files in an identified branch into the current working branch. It also adds any new files. When you perform a merge, a new commit will be automatically created to track the merge. To merge branches, commit any changes to the branch you want to merge (in this example, the ‘hotfix’ branch) then checkout the branch into which you want to merge (for example, master), and then execute a merge command.

$ git commit -m 'commiting staged files in hotfix branch'
$ git checkout master
$ git merge hotfix

Branching Workflows

Introduction to R: Part 1

Learning objectives

After this lecture, you should be able to:

  • define reproducible research and the role of programming languages
  • explain what R and RStudio are, how they relate to eachother, and identify the purpose of the different RStudio panes
  • create and save a script file for later use; use comments to annotate
  • solve simple mathematical operations in R
  • create variables and dataframes
  • inspect the contents of vectors in R and manipulate their content
  • subset and extract values from vectors
  • use the help function

Before We Start

What is R and RStudio? “R” is both a free and open source programming language designed for statistical computing and graphics, and the software for interpreting the code written in the R language. RStudio is an integrative development environment (IDE) within which you can write and execute code, and interact with the R software. It’s an interface for working with the R software that allows you to see your code, plots, variables, etc. all on one screen. This functionality can help you work with R, connect it with other tools, and manage your workspace and projects. You cannot run RStudio without having R installed. While RStudio is a commercial product, the free version is sufficient for most researchers.

Why learn R? There are many advantages to working with R.

  • Scientific integrity. Working with a scripting language like R facilitates reproducible research. Having the commands for an analysis captured in code promotes transparency and reproducibility. Someone using your code and data should be able to exactly reproduce your analyses. An increasing number of research journals not only encourage, but are beginning to require, submission of code along with a manuscript.
  • Many data types and sizes. R was designed for statistical computing and thus incorporates many data structures and types to facilitate analyses. It can also connect to local and cloud databases.
  • Graphics. R has buit-in plotting functionalities that allow you to adjust any aspect of your graph to effectively tell the story of your data.
  • Open and cross-platform. Because R is free, open-source software that works across many different operating systems, anyone can inspect the source code, and report and fix bugs. It is supported by a large community of users and developers.
  • Interdisciplinary and extensible. Because anyone can write and share R packages, it provides a framework for integrating approaches across domains, encouraging innovation.

Navigating the interface

  • Source is your script. You can save this as a .R file and re-run to reproduce your results.
  • Console - this is where you run the code. You can type directly here, but it won’t save anything entered here when you exit RStudio.
  • Environment/history lists all the objects you have created and the commands you have run.
  • Files/plots/packages/help/viewer pane is useful for locating files on your machine to read into R, inspecting any graphics you create, seeing a list of available packages, and getting help.

To interact with R, compose your code in the script and use the commands execute (or run) to send them to the console. (Shortcuts: You can use the shortcut Ctrl + Enter, or Cmd + Return, to run a line of code).

Create a script file for today’s lecture and save it to your lecture_4 folder under ist008_2021 in your home directory. (It’s good practice to keep your projects organized., Some suggested sub-folders for a research project might be: data, documents, scripts, and, depending on your needs, other relevant outputs or products such as figures.

Mathematical Operations

R works by the process of “REPL”: Read-Eval-Print Loop:

  1. R waits for you to type an expression (a single piece of code) and press Enter.
  2. R then reads in your commands and parses them. It reads whether the command is syntactically correct. If so, it will then
  3. evaluate the code to compute a result.
  4. R then prints the result in the console and
  5. loops back around to wait for your next command.

You can use R like a calculator to see how it processes commands. Arithmetic in R follows an order of operations (aka PEMDAS): parenthesis, exponents, multiplication and division, addition and subtraction.

7 + 2
7 - 2
244/12
2 * 12

To see the complete order of operations, use the help command:

?Syntax

HELP!

This is just the beginning, and there are lots of resources to help you learn more. R has built-in help files that can be accessed with the ‘?’ and args() commands. You can search within the help documentation using the ?? commands. (Note: to get help with arithmetic commands you must put the symbol in single or double quotes.) You can view the package documentation using packageDescription(“Name”). And, you can always ask the community: Google, Stack Overflow [r], topic-specific mailing lists, and the R-help mailing list. On CRAN, check out the Intro to R Manual and R FAQ. When asking for help, clearly state the problem and provide a reproducible example. R also has a posting guide to help you write questions that are more likely to get a helpful reply. It’s also a good idea to save your sessionInfo() so you can show others how your machine and session was configured.

Calls

R has many functions (reusable commands) built-in that allow you to compute mathematical operations, statistics, and other computing tasks. Code that uses a function is said to call that function. When you call a function, the values that you assign as input are called arguments. Some functions have multiple parameters and can accept multiple arguments.

log(10)
sqrt(9)
sum(5, 4, 1)

Variables

A variable is a name for a stored value. Variables allow you to reuse the result of a computation, write general expressions (such as ax + b), and break up your code into smaller steps so it’s easier to test and understand. Variable names can contain letters or numbers, but they cannot begin with a number. In general, variable names should be descriptive but concise, and should not use the same name as common (base R) functions, like mean, T, median, sum, etc.

x <- 10
y <- 24
fantastic.variable2 = x
x<-y/2

In R, variables are copy-on-write. When we change a variable (a “write”), R automatically copies the original value so dependent variables are unchanged until they are re-run.

x = 13
y = x
x = 16
y

Data Types and Classes

R categorizes data into different types that specify how the object is stored in memory. The typeof() command will return the data type of an object. These types map to how we categorize data in statistics:

  • continuous (real numbers)
  • discrete (integers, or finite number of values)
  • logical (1 or 0, T or F)
  • nominal (unordered categorical values)
  • ordinal (ordered categorical values)
  • graph (network data)
  • character (text data)

Perhaps more useful for day-to-day programming is an object’s class, which specifies how it behaves. Classes in R are hierarchical:

  • logical (TRUE, FALSE)
  • integer (2, 4, 7)
  • numeric (double, 2, 3, 5.7)
  • complex (3i)
  • character (“marie curie”,“grace hooper”)
x<- 2
class(x)

y<- "two"
class(y)

class(TRUE)

class(mean)

Vectors

A vector is an ordered collection of values. The elements in the vector must have the same data type. (While class and type are independent, for vectors they are typically the same and thus you can expect that they typically should have the same class.) You can combine or concatenate values to create a vector using c().

v<-c(16, 3, 4, 2, 3, 1, 4, 2, 0, 7, 7, 8, 8, 2, 25)
class(v)

place <- c("Mandro", "Cruess", "ARC", "CoHo", "PES", "Walker", "ARC",
  "Tennis Courts", "Library", "Arboretum", "Arboretum", "Disneyland", "West
  Village", "iTea", "MU")
class(place)

What happens if you make a typo or try to combine different data types in the same vector? R resolves this for you and automatically converts elements within the vector to be the same data type. It does so through implicit coercion where it conserves the most information possible (logical -> integer -> numeric -> complex -> character). Sometimes this is very helpful, and sometimes it isn’t.

Basic statistics on vectors

You can use functions built into R to inspect a vector and calculate basic statistics.

length(v)   # returns how many elements are within the object
length(place)
min(v)      # minimum value
max(v)          # maximum value
mean(v)
median(v)
sd(v)       # standard deviation

Matrices, Arrays & Lists

Matrices are two-dimensional containers for values. All elements within a matrix must have the same data type. Arrays generalize vectors and matrices to higher dimensions. In contrast, lists are containers for elements with different data types.

Data Frames

We frequently work with 2-dimensional tables of data. For a tabular data set, typically each row corresponds to a single subject and is called an observation. Each column corresponds to the data measures or responses – a feature or covariable. (Sometimes people will also refer to these as variables, but that can be confusing as “variable” means something else in R, so here we’ll try to avoid that term.) R’s structure for tabular data is the data frame.

A data frame is a list of column vectors. Thus, elements of a column must all have the same type (like a vector), but elements of a row can have different types (like a list). Additionally, every row must be the same length. To make a data frame in R, you can combine vectors using the data.frame() command.

distance.mi <- c(3.1, 0.6, 0.8, 0.2, 0.5, 0.2, 0.7, 0.5, 0, 1.2, 1.2, 501, 1.6,
  0.4, 4.7)
time.min <- v
major <- c("nutrition", "psychology", "global disease", "political science",
  "sociology", "sustainable agriculture", "economics", "political science",
  "undeclared", "psychology", "undeclared","economics","political science",
  "english", "economics")

my.data <- data.frame(place, distance.mi, time.min, major)

Inspecting Data Frames

You can print a small dataset, but it can be slow and hard to read especially if there are a lot of coumns. R has many other functions to inspect objects:

head(my.data)
tail(my.data)
nrow(my.data)
ncol(my.data)
ls(my.data)
rownames(my.data)
str(my.data)
summary(my.data)

Subsetting

Sometimes you will want to work with only specific elements in a vector or data frame. To do that, you can refer to the position of the element, which is also also called the index.

length(time.min)
time.min[15]

You can also subset by using the name of an element in a list. The $ operator extracts a named element from a list, and is useful for extracting the columns from data frames.

How can we use subsetting to look only at the distance response?

my.data$distance.mi
my.data[,2]
distances2<-my.data[["distance.mi"]]
distances3<-my.data[[2]]

What are the responses for political science majors?

polisci_majors <- my.data[which(my.data$major == 'political science'), ]
View(polisci_majors)

which(my.data$major == "political science")
shortframe<-my.data[c(4,8,13),]

What are the majors of the first 5 students who replied?

shortframe2 <- my.data[1:5,"major"]             # range for rows, columns

You can also use $ to create an element within the data frame.

my.data$mpm <- my.data$distance.mi / my.data$time.min

Factors* are the class that R uses to represent categorical data. Levels are categories of a factor.

levels(my.data$major)